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ABSTRACT 



Many online networks are measured and studied via sam- 
pling techniques, which typically collect a relatively small 
fraction of nodes and their associated edges. Past work in 
this area has primarily focused on obtaining a representative 
sample of nodes and on efficient estimation of local graph 
properties (such as node degree distribution or any node at- 
tribute) based on that sample. However, less is known about 
estimating the global topology of the underlying graph. 

In this paper, we show how to efficiently estimate the 
coarse-grained topology of a graph from a probability sam- 
ple of nodes. In particular, we consider that nodes are par- 
titioned into categories {e.g., countries or work/study places 
in OSNs), which naturally defines a weighted category graph. 
We are interested in estimating (i) the size of categories and 
(ii) the probability that nodes from two different categories 
are connected. For each of the above, we develop a fam- 
ily of estimators for design-based inference under uniform 
or non-uniform sampling, employing either of two measure- 
ment strategies: induced subgraph sampling, which relies 
only on information about the sampled nodes; and star sam- 
pling, which also exploits category information about the 
neighbors of sampled nodes. We prove consistency of these 
estimators and evaluate their efficiency via simulation on 
fully known graphs. We also apply our methodology to a 
sample of Facebook users to obtain a number of category 
graphs, such as the college friendship graph and the country 
friendship graph; we share and visualize the resulting data at 
[www ■ geosocialmap . coiii| 

Keywords 

Online Social Networks, coarse-grained topology, in- 
duced subgraph sampling, star sampling, Facebook. 

1. INTRODUCTION 

Many large online networks, such as online social net- 
works (OSNs) and the World Wide Web (WWW), are 

*We make our datasets available, together with a customiz- 
able web-based visualization at www.geosocialmap.com 
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Figure 1: Nodes in the original graph (G) be- 
long in one of three categories: white, gray, and 
black. The category graph (G"^) consists of three 
nodes, corresponding to the three categories, 
connected by weighted edges. The edge weight 
w(o, •) in G*^ is the probability that a black and 
a white node, randomly chosen from G, are con- 
nected in G (see Eq.dS])). The main goal of this 
paper is to estimate these edge weights based on 
a probability sample of nodes of G. 

currently studied via sampling techniques. Sampling 
becomes necessary due to the sheer size of these net- 
works and/or access limitations, which make it infea- 
sible to collect (and, in some cases, to analyze) these 
networks in their entirety. 

Most principled graph sampling methods to date have 
focused on collecting a probability sample of nodes [6, 
19,20,30,35,51-53,60]. Based on such a sample, one 
can efficiently estimate many local graph properties, 
such as node attribute frequency, degree distribution, 
degree-degree correlations, or clustering coefficients [26, 
34]. However, these features reveal little about the 
global properties of the underlying graph, such as path- 
based properties (connectivity, diameter, average short- 
est path length) or community structure. 

In this paper, we show how a particular aspect of 
global network structure, namely coarse-grained topol- 
ogy, can be efficiently estimated from a probability sam- 
ple of nodes. Specifically, we note that nodes in many 
online graphs belong to categories, explicitly declared 
by users or clearly determined by observable character- 
istics. For example, in Facebook, users can officially 
declare the college or workplace with which they arc af- 
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filiated, or a country/city in which they live. Similarly, 
in the WWW, all nodes can be categorized by their do- 
main names, and the users of Internet radio sites like 
Last . FM may be grouped on the basis of listening be- 
havior. This potentially allows us to build and study 
category graphs, in which each node corresponds to a 
category and edge weights reflect the frequency of ties 
between category members in the original graph. We 
illustrate these concepts in Fig. [T] 

The contribution of this paper lies in developing and 
evaluating several efficient estimators for two properties 
of the category graph, namely the size of the categories 
and the edge weights. These estimators take as input 
a uniform or non-uniform probability sample of nodes, 
measured via one of two strategies: induced subgraph 
sampling, in which we have information regarding only 
the sampled nodes; and star sampling, in which we also 
have category information about the neighbors of sam- 
pled nodes. We show that our estimators have good 
asymptotic properties (consistency, and hence asymp- 
totic unbiasedness) and we evaluate their efficiency via 
simulation: employing fully observed graphs from both 
synthetic and empirical sources, we examine how es- 
timator performance varies with the properties of the 
underlying graph. Finally, as a practical illustration 
of our approach, we apply our methodology to a sam- 
ple of Facebook nodes to estimate several Facebook 
category graphs, such as the inter-college and inter- 
country friendship graphs. The resulting Facebook cat- 
egory graphs are made available (along with a highly- 



the degree of node v €V , and by 

vol(A) ^ dcg(t') 
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customizable, web-based visualization service) at www . geos 

The structure of the remainder of the paper is as 
follows. Section 2 presents the problem statement. Sec- 
tion 3 reviews node sampling techniques. Sections 4 
and 5 present our estimators for uniform and non-uniform 
probability samples, respectively. Section 6 presents 
simulation results on fully known graphs. Section 7 ap- 
plies our estimators to samples of Facebook. Section 8 
reviews related work. Section 9 concludes the paper. 
Finally, in Appendix we prove the consistency of all es- 
timators proposed in this paper. 



2. NOTATION AND PROBLEM STATEMENT 
2.1 Basic graph G 

We consider an undirected, static0 graph G = {V, E), 
with N= \V\ nodes and \E\ edges. Denote by deg(w) 



the volume of a set of nodes A CV. We will often use 

to denote the relative size of A in terms of number of 
nodes and volume, respectively. 

2.2 Category graph G"^ 

We assume that the set of nodes V is partitioned 
into a set C of categories, i.e., that IJceC ~ ^® 
are interested in the category graph G"^ = {C,E'^), with 
node set given by the categories of gH For two different 
categories A, B € C, A B, denote by Ea.b C E the 
corresponding edge-cut in G, i.e., 

Ea,b = {{u, v} !^ E : u g A and v e B}. 

If \Ea,b\ > then we draw an edge {A, B} between A 
and B in . We show an example of a category graph 
in Fig. m 

The way we defined category graph G"^ so far, pre- 
vents self-loops, but potentially allows for edge weights. 
The weight w{A, B) of edge {A, B} can be defined in 
a number of ways. For instance, one could trivially 
set it always equal to 1. In some settings, e.g., sta- 
tistical modeling, the number of inter-category edges, 

oc'fllifisl/.'^b-S'^'-^l' ^ useful choice. For many purposes, 
liuwcvci , IL is useful to have a notion of edge weight that 
adjusts for category size, e.g.. 



iiA,B) 



\Ea.. 



\A\-\B\ 



(3) 



This definition has an intuitive interpretation. Because 
1^1 • \B\ is the size of the maximum possible edge-cut 
from A to B, w{A, B) is equal to the probability that 
a uniformly selected member of A is connected to a 
uniformly selected member of B. We give an example 
of these weights w{A, B) in Fig. [T] 

2.3 Goal: Estimate G"^ through sampling 

Given the full knowledge of graph G, it is trivial to 
construct the category graph with its edge weights. In 
many cases, however, the knowledge of the full graph G 
is not available, rendering exact computation of Eq.Q 
infeasible. For instance, downloading the entire Face- 
book social graph via HTML scraping would require 



^Sampling dynamic graphs is currently an active research 
area [51,60,67], but out of the scope of this paper. Indeed, 
during the collection of Facebook data sets we use, the un- 
derlying graphs changed very insignificantly [20,35]. More- 
over, in this paper we focus on coarse granularity, which 
should change even more slowly in time, as argued in [67]. 



^We are not the first ones to be interested in coarse-grained 
structures. See, e.g., the social network literature on block- 
models [66], in which our categories correspond to positions, 
our category graph to the reduced graph or block image, and 
our edge weights to block densities or mixing rates. See Sec- 
tion |S] for additional references. 
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downloading and processing about 50 terabytes of HTML 
traiEc [20], which is rather prohibitive in practice. 

In contrast, it is often possible to collect a sample 
S C V oi nodes of G. Note that we permit S to contain 
multiple copies of the same node, i.e., the sampling with 
replacement. The challenge, then, and the main goal of 
this paper is to estimate the category graph G"^ based 
on the sample S. 

3. SAMPLING 

Our methodology takes as input a probability sample 
of nodes. Obtaining such a sample is an active research 
topic in its own right (see Section [8]). In Section [3Tl 
we briefly review the node sampling techniques that we 
use later in simulations and Facebook implementation. 

Independently of the sampling technique employed, 
we may collect less or more category information on 
each sampled node. In Section 13. 2[ we describe two 
scenarios most common in practice. As we will see later, 
they result in two different sets of estimators, often with 
very different performance. 

3.1 Node sampling techniques 

3.1.1 Independence Sampling 

Under independence sampling, we sample nodes in- 
dependently from the set V , with replacement. We dis- 
tinguish two general cases: Uniform Independence Sam- 
pling (UIS), where sampling probabilities are uniform 
(the same for all nodes); and Weighted Independence 
Sampling (WIS), which samples v with probability pro- 
portional to a known weight w(i;). 

In general, UIS and WIS are not feasible in online 
networks because of the lack of sampling frame. For 
example, the list of all user IDs may not be publicly 
available, or the user ID space may be too sparsely 
allocated to permit rejection sampling. Nevertheless, 
these techniques can occasionally be employed, either 
when permitted by fortuitous circumstances (see e.g., 
use by [19,20]) or when deliberately "down-sampling" a 
large graph to speed analysis. Independence samplers 
are also conceptually important as a baseline for com- 
parison with crawling-based sampling methods. 

3.1.2 Sampling via Crawling 

In contrast to independence sampling, crawling tech- 
niques are feasible in many online networks, and are 
thus the main focus of this paper. The crawling meth- 
ods described here lead to an approximate probability 
sample (asymptotically approaching UIS or WIS) from 
the node set, in the limit of increasing sample size. 

Simple Random Walk (RW) [41] selects the next-hop 
node V uniformly at random among the neighbors of the 
current node u. On a connected and aperiodic graph, 
RW samples node v with probability linearly propor- 
tional to its degree deg(u). 



(a) Induced subgraph sampling (b) Star sampling 




Sampled nodes 

O Unsampled nodes, with known category 
O Unknown nodes 
— Observed edges 

Figure 2: Observed categories and edges, under 
two scenarios we study in this paper. 

Weighted Random Walk (WRW) is RW on a weighted 
graph [5]. In our simulations and implementation, we 
use "Stratified WRW," or S-WRW [35], i.e., a version 
of WRW that increases the sampling efficiency by over- 
sampling graph regions relevant to the measurement ob- 
jective and under-sampling the irrelevant ones. 

Metropolis-Hastings Random Walk (MHRW) is aver- 
sion of random walk that modifies the transition proba- 
bilities to converge to a desired stationary distribution 
(often uniform). It was shown in [20,51] that RW out- 
performs MHRW for most applications, which we ob- 
serve in our implementation as well. 

3.2 Observed categories and edges 

Our estimators will make use of every fully observed 
edge, i.e., edge for which we know the categories 

of both u and v. We distinguish between two measure- 
ment scenarios [34] that yield different sets of observed 
edges, as follows. 

3.2.1 Induced Subgraph Sampling 

Under induced subgraph sampling, we learn the cat- 
egories of the sampled nodes only. Consequently, the 
observed edges are only the edges induced on the set S 
of sampled nodes, as shown in Fig. [2lja). 

3.2.2 Star Sampling 

In some settings, sampling a node u € S reveals the 
categories of all its neighbors (not only the neighbors 
in S). This is typically the case when sampling is done 
through scraping the HTML pages of OSNs [20,35]. We 
refer to this as star sampling^ and we show an example 
in Fig.Hb). 

Finally, we emphasize that star sampling requires only 
information about neighbors' categories; their degree or 
friend list is not needed, nor ties among neighbors (as 
in complete egonet sampling [66]). 

■^To be precise, following the terminology of [34] , labeled star 
sampling. The unlabeled star sampling gets only the total 
number of neighbors, without their identities or categories. 
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4. UNIFORM SAMPLING 

In this section, we provide design-based estimators for 
category sizes and category graph edge weights, given a 
uniform independence (UIS) sample from the node set. 
All estimators shown in this section and in Section [5] 
are consistent; proofs are provided in the Appendix. 

4.1 Estimating category size (\A\) 

Learning the size of a given category can be an im- 
portant measurement objective per se. Moreover, it is 
also a building block of the edge weight estimators we 
derive in Section 14.2.21 

4.1.1 Induced subgraph sampling 

The size |^| of category A can be trivially estimated 
by multiplying by N the fraction of nodes sampled in 
A, i.e., 

\Sa\ 



\A\ = N- 



\S\ ' 



(4) 



where 

Sa = {v e S : V e A} 
is a multiset containing all samples from category A. 

4.1.2 Star sampling 

Although not obvious at first blush, star sampling 
gives us an alternative way to estimate category sizes. 
Denote by 



\v\ 



the average node degree in category A and in the en- 
tire graph, G, respectively. Because vol(A) = \A\ ■ kA, 
we can re-write the relative volume /^°' of category A 
(see Eq.lg])) as 



vo\{A) _ \A\-kA \A\-kA 



Yol{V) \V\ -kv N-kv' 
This allows us to estimate the size |^| of category A as 

kv 



N-Ta'-^- 

kA 



(5) 



This formula may seem less attractive than Eq.Q, be- 
cause we now have to estimate three different numbers. 
However, ky and kA can be easily estimated, respec- 
tively by 

kv = ^^n^ and = (6) 



\S\ 

Similarly, /^°' could be estimated by 



\Sa\ 



fT = 



Et,g5deg(iO ■ l{^eA} 



But we have proposed in [35] a much more efhcient star- 
based estimator of f]^\ i.e., 



^ ' seSveAfis) 



(7) 



By plugging Eq.® and Eq.© into Eq.®, we obtain a 
complex yet powerful star-based estimator of size \A\. 

Wc show later that the star sampling estimator of 
Eq. ^ often outperforms the trivial estimator or Eq. (j4|) , 
especially in dense graphs. One reason for this result 
is that Eq.@ employs only the number \Sa\ of sam- 
ples from A. This number is a random variable with 
a potentially high variance (especially for walks). In 
contrast, Eq.© relies on mean degree estimates rather 
than on counting-based estimates, which employ more 
information (edges not in G[S]) and tend to be more 
stable. 

4.2 Estimating category edge weights {w{A,B)) 

Recall from Eq.Q that, given the full knowledge of 
graph G, the weight w(^, B) is obtained by dividing 
the number of edges between A and B by the maxi- 
mal possible number of such edges. We use this same 
idea when estimating w(^, B) from our sample 5", ex- 
cept that now we divide the number of edges observed 
between A and B by the maximal number of such edges 
we could potentially observe. 

4. 2. 1 Induced subgraph sampling 

Under induced subgraph sampling, we observe edges 
between the sampled nodes only. Consequently, in our 
sample we observe J2aGSA ^beSg ^{{aM&E} edges be- 
tween distinct categories A and B, out of the maximal 
number • l^sl we could possibly observe, leading to 
the trivial estimator 



w{A,B) 



E 

aeSA beSs 



b}eE} 



I^aI-I^bI 



(8) 



(Note that when 5* contains the same node multiple 
times, we count any corresponding sampled edges mul- 
tiple times as well.) 

4.2.2 Star sampling 

Under star sampling, on sampling node a £ ^ we 
observe the set Ea^B C E of all edges between a and 
category B A. So we observe |i?a,B| edges out of a 
potential \B\ edges between a and B. If we consider all 
nodes Sa we sampled from A, we observe X^ogSa \^<^-b\ 
out of a potential \Sa\ ■ \B\ edges. The same applies 
to nodes Sb sampled in B and their neighbors in A. 
Consequently, we can estimate the category graph edge 
weight w(A, B) by dividing the total number of edges 
we observed between A and B by our estimate of the 
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maximal number we could potentially observe, i.e., 



E 

aeSA 



beSs 



(9) 



\Sa\-\B\ + \Sb\-\A\ 

Note that because we usually do not know the real sizes 
of A and B, Eq.® uses their estimators \A\ and |_B|. 
We can employ either Eq.Q or Eq.(l5|), as needed. 

Observe that the star sampling estimator is poten- 
tially more efBcicnt than the trivial induced subgraph 
estimator, because we include edges (and non-edges) 
between sampled members of A and B and members of 
the respective sets that were not themselves sampled. 
For categories with large mean degree, this may rep- 
resent a substantial increase in information versus the 
induced subgraph case. 

4.3 Population size (N) 

In our estimation of category sizes, the population 
size N= \ V\ is required. In some cases N is known (e.g., 
in an OSN context, it may be published by the service 
provider), but in general this is not the case. Fortu- 
nately, where N is not available, we can turn to esti- 
mation. For instance, [33] proposes an approach based 
on a "reversed coupon collector" problem, which can be 
used with both uniform and non-uniform sampling. 

Finally, we note that N is only necessary where abso- 
lute values of category sizes are required. Specifically, 
all edge weights and category sizes can be estimated up 
to a constant of proportionality without knowing the 
size of the total population. Thus, if we are interested 
in ratios of category sizes and/or edge weights {e.g., the 
relative weight of the A, B connection versus the A, C 
connection in G"^), then N can be ignored (and replaced 
by an arbitrary constant in the above equations). 

5. NON-UNIFORM SAMPLING 

The estimators derived in Section |4] hold under UIS, 
where every node v ^ V is sampled with the same prob- 
ability. Such a sampling design is rarely feasible in prac- 
tice. Moreover, in some cases UIS may be also unde- 
sirable, e.g., when some categories are irrelevant to our 
measurement [35]. 

A more common scenario is non-uniform probabil- 
ity sampling, where every node u £ 1/ is sampled with 
probability proportional to a known weight w(v). In- 
deed, this is the case for WIS, RW, S-WRW and other 
principled walk-based sampling methods, provided that 
samples have adequately converged [20]. Non-uniform 
samples are by definition biased towards nodes of higher 
weight (typically degree), which may dramatically dis- 
tort the estimation results if used without correcting for 
sampling probabilities [21]. 

Fortunately, where sampling weights are known (as in 
the above designs), they can be corrected for by an ap- 



propriate (though not necessarily obvious) re- weighting 
of the measured values. In this section, we rewrite all 
estimators from Section |4] in such a corrected form. 

5.1 Correcting for sample bias 

A weighted sample can be unbiased using the Hansen- 
Hurwitz estimator [25] as shown e.g., in [56,65] for ran- 
dom walks and also used in [51]. Let every node v &V 
carry a value x{v). We can estimate the population 
total J2v ^i'") by 



1 2;(u) 
n ^—^ 7t(v) ' 



(10) 



where -Kiy) is the sampling probability of node v. 

In practice, we usually know 7r(t;), and thus Xtot, only 
up to a constant, i.e., we know the (non-normalized) 
weights w(w). w(w) ^ niv). Fortunately, we can often 
address this problem by estimating the ratio of two to- 
tals, which makes the unknown constants cancel out. 
We will use this approach below. 

5.2 Estimating category size (JA]) 

5.2.1 Induced subgraph sampling 

Following Eq. lfTO)) . we can estimate \Sa\ by setting 
x{v) EE l{,eA}. This yields \Sa\ - ^ E„e5 ^ = 
^E.es^TM- Analogously, \S\ = i^,^^^^. Con- 
sequently, we can rewrite Eq.Q as 



\A\ = N- 
= N- 



Et>eS 7r(u) 

w_i{S) 



N ■ 



^hGSa w(-u) 
/—iv^S w{v) 



(11) 



where 



is a 're- weighted size' of multiset X C V. 

5.2.2 Star sampling 

As in Section l4.1.21 we estimate the size of a category 
A using Eq.®, i.e., 



\A\ = N-Tr-Y^. 

kA 



(12) 



However, now, the terms /^°', ky and kA must be calcu- 
lated taking into account the sampling weights. Indeed, 
the weighted version of /^°' is (after [35]) 



1 



E 



dcg(s) 
w(s) 



■E 



1 



E 



Wis) ^ ' 



(13) 
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Similarly, the estimators Eq.Q of ky and kA can be 
rewritten respectively by 



dcg(u) 



, * — 'u^^S w(v) v^Sa w(v) /-I a\ 

5.3 Estimating category edge weights (w(A, B)) 

5.i.i Induced subgraph sampling 

Note that in the numerator of Eq.®, we have a smn 
over node pairs, rather than single nodes. In this case, 
Hansen-Hurwitz estimator divides every component by 
the product of weights of the two involved nodes [34], 
which yields 

1/ 



w{A,B) = 



^ ^ w(a) • w(&) 
w_i(5a) • w_i(5b) 



(15) 



5.3.2 Star sampling 
Finally, under nonuniform sampling, Eq.Q becomes 



w{A,B) 



V 

^ w(a) 



E 



(16) 



^_,{Sa)-\B\ + ^_,{Sb)-\A\ 

Again, we have two size estimators Eq. (ITT]) and Eq. ([T^ 
to choose from to plug into \ A\ and \B\. We recommend 
selecting the one with smaller variance for the specific 
application. This variance can be estimated, t.g., using 
bootstrapping [9]. 

5.4 Sampling via crawling 

As we argued in Section 13.1.21 in many online net- 
works the only feasible sampling approach is via crawl- 
ing. Such techniques result in non-uniform sampling 
probabilities, and, consequently, sampling weights. For 
example, under RW the sampling weights converge asymp- 
totically to w(w)=deg(u) [41]. Using these weights in 
conjunction with the WIS estimators above allows for 
consistent estimation of coarse-grained topology from 
random walk samples. 

Of course, consecutive samples collected by crawls 
are in general correlated, which can potentially affect 
the efficiency of our estimators. One way to deal with 
that is to take, say, every T-th sample. For T large 
enough, this thinning technique effectively reduces sam- 
ple correlations, at a cost of discarding a large portion 
of available information. Thinning is crucial in some 
applications, e.g., those primarily based on counting 
repeated nodes, as in [33]. The ergodicity of standard 
random walk designs, however, guarantees convergence 
to the target (WIS) distribution with any effect of au- 
tocorrelation vanishing in the limit of sample size. (See 
Appendix.) 



6. SIMULATION RESULTS 

6.1 Objective and performance metrics 

In this section, we apply our methodology to fully ob- 
served graphs from both synthetic and empirical sources. 
Our objective is to evaluate estimator performance by 
comparing the (known) values of the category sizes and 
edge weights in each case with the values inferred us- 
ing our methods. We use the Normalized Root Mean 
Square Error (nrmse) to assess estimation error: 



NRMSE(a:) 



[(x-a;)2 



(17) 



where x is the real value and x is the estimate. 



6.2 Generated topologies 

First, we consider synthetic graphs. By simulating G, 
we control many crucial parameters (such as graph den- 
sity, or category size and tightness) and study the effect 
of these parameters on the efficiency of our estimators. 

6.2.1 Graph model 

We consider a graph G with A'' = 88, 850 nodes parti- 
tioned into 10 categories. Their sizes range from |C|=50 
to |C|= 50000. Initially, nodes in each category form 
a fc-regular random graph, with the average node de- 
gree ranging from fc = 5 to k= 49. In addition, we 
add A^ ■ fc/10 random edges between nodes in different 
categories. The resulting graph G is connected (in all 
instances we used) and has = 0.6 • A^ • fc edges. By 
construction, G has a very strong community structure. 
In order to study the effect of community tightness, we 
next permute randomly the category labels of a frac- 
tion OL G [0, 1] of nodes. For a= 0, node categories fol- 
low the strong community structure, whereas for a= 1 
the categories are completely independent of the graph 
structure. 

6.2.2 Category sizes 

We first study the efficiency of the category size esti- 
mators, Eq.(|4l) and Eq.©. We present the results in the 
top row of Fig. [3] and make the following observations. 

In all of our simulated cases, all estimators converge 
to the true value as sample size increases. Moreover, 
the star estimator performs better than the induced 
subgraph estimator, although its efficiency can depend 
on properties of G. For example, (i) the denser the 
graph, the better the star estimator is (Fig. |3lja)), but 
(ii) its efficiency can be limited when clustering closely 
follows the category structure (Fig. [3l[b)). In contrast, 
the induced subgraph estimator is not affected by any 
of these properties. We also observe that both estima- 
tors perform better for larger categories (Fig. [3l[c)). In 
Fig. IStd), we show the CDF of the NMSE of all (ten) 
estimators of the category sizes. 
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Figure 3: Simulations of UIS on synthetic graphs. We estimate category sizes (top) and category 
edge weights (bottom), using induced subgraph sampling (circles) and star sampling (stars). 



Dataset 




\E\ 


kv 


Facebook: Texas [62J 


36 364 


1590 651 


87.5 


Facebook: New Orleans [64] 


63 392 


816 885 


25.8 


P2P [40] 


62 561 


147 877 


4.7 


Epinions [54] 


75 877 


405 738 


10.7 



Table 1: Empirical topologies used in Sec. 16.31 



6.2.3 Category edge weights 

In the bottom row of Fig.[3l wc use Eq.(l8]) and Eq.® 
to estimate the category edge weights under induced 
and star sampling designs, respectively. 

Again, both estimators converge, with the star esti- 
mator performing better than the induced one. As be- 
fore, the star estimator benefits from higher graph den- 
sity (Fig.|3l[e)) and looser category structure (Fig.lS^f)). 
However, in this case the induced estimator is affected 
by these properties as well. Finally, in Fig.[3l^g) we com- 
pare the estimation efficiency of low-weight edge eiow 
(defined as the edge with 25*^* percentile weight) with 
the high- weighted edge Chigh (75*'' percentile). As be- 
fore, both estimators perform better for large estimated 
values. 

6.3 Empirically observed topologies 

6.3.1 Datasets 

We consider four fully known topologies described 
in Table [T] We use two graphs extracted from Face- 
book because (i) they significantly differ in density, and 



(ii) Facebook is our focus in the experimental study 
of Section [71 

In Section 16.21 wc have seen that star sampling per- 
forms the worst if categories are aligned with the com- 
munities (dense clusters) existing in graphs. We decided 
to simulate presumably the worst-case category parti- 
tion from the star sampling point of view. In particular, 
we use a standard community finding algorithm based 
on eigenvalues [47] to identify the 50 largest communi- 
ties, and define each such community to be a category. 
All the remaining smaller categories (if any) are then 
grouped together as the 51*** category. 

From these known graphs wc then generate synthetic 
datasets by three different sampling methods: UIS, RW 
and S-WRW. Under S-WRW [35], we use equal category 
weights for all categories, and we set /e = (because 
there are no irrelevant categories) and 7 = cx) (for sim- 
plicity). As previously, our interest is in whether our 
estimators (applied to these realistic samples) will ac- 
curately reconstruct the true properties of the graphs 
in question. 

6.3.2 Category size 

We study the efficiency of the category size estima- 
tors in the top row of Fig. |4l Due to lack of space, we 
only report the median NRMSE across all categories. In 
Fig. [31[d), this would correspond to the points on the 
horizontal line Y = 0.5. 

The main observation is that, in contrast to Sec- 
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Figure 4: Simulations on empirically observed graphs. We estimate category sizes (top) and category 
edge weights (bottom), using induced subgraph sampling (circles) and star sampling (stars). 



tion 16. 2[ here the induced estimators can outperform 
the star estimators. This is particularly visible under 
UIS, probably because of the highly skewed node degree 
distributions. Such a distribution increases the variance 
of the average degree estimator kA that is used in the 
star-based size estimation in Eq.([5])Il 

However, in contrast to UIS, under RW and S-WRW 
star sampling usually performs better. This can also be 
explained by the highly skewed node degree distribu- 
tion. Indeed, because both RW and S-WRW visit high- 
degree nodes more often than UIS, their star samples 
inherently collect and exploit more information about 
neighbor categories, which translates to a better perfor- 
mance. This effect is similar to the better star sampling 
performance under higher graph density in Section 16.21 

6.3.3 Category edge weights 

While there is no clear winner in the category size 
estimation, in the category edge weight estimation star 
sampling consistently and significantly outperforms in- 
duced samphng. Indeed, in Fig. |4l^e-h), the induced es- 
timators often need 5-10 times more samples to achieve 

*We might address this problem by modifying Eq.® to take 

e.g., kA=kv or a similar model-based extension. Such mod- 
ifications may greatly reduce the variance of size estimation, 
albeit at the cost of some bias. (Indeed, this is an example of 
the classic "precision vs accuracy" tradeoff.) Note that such 
modifications can allow us to use Eq.© to estimate |yl|, 
even if none of our sampled vertices were drawn from A. 
Our initial experiments with such modifications have been 
encouraging, but we do not treat them in depth here. 



Dataset 


Studied categories 


Crawl 
type 


% categ. 
samples 


# total 
samples 


2009 [20] 


Regional (507) 
(34% of population) 


MHRW09 
RW09 
UIS09 


34% 
41% 
34% 


28x8 IK 
28x8 IK 
28x35K 


2010 [35] 


Colleges (lOK-f) 
(3.5% of population) 


RWIO 
S-WRWIO 


9% 
86% 


25x40K 
25x40K 



Table 2: Facebook datasets. 



the same accuracy as star estimators. 

UIS clearly performs best, especially when estimat- 
ing category sizes. Not surprisingly, direct indepen- 
dence sampling should be preferred whenever available. 
In the more practical scenarios, however, we are lim- 
ited to exploration-based techniques. In our simula- 
tions, S-WRW is consistently better than RW. Note 
that because all categories (and thus nodes) arc rele- 
vant, this advantage of S-WRW is purely due to strati- 
fication. Moreover, the advantage of S-WRW increases 
with higher heterogeneity of category sizes (not shown 
here), which is in agreement with [35]. 

7. FACEBOOK CATEGORY GRAPHS 

In this section, we use the estimators developed in this 
paper to infer several category graphs from Facebook. 

7.1 Data sets 

In our previous work [20,35], we collected samples 
of Facebook users (about 10.1 million total users), with 
publicly available information. These datasets are sum- 
marized in Table [2] and are used as input for the esti- 
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1000 2000 3000 4000 5000 6000 7000 8000 9000 

Faccbook categories 



Figure 5: Number of samples per category in 
Facebook 2009 (top) and 2010 (bottom). 

mators of this paper. These datasets were cohected 
using HTML scraping, which ahowed us to collect for 
each user v not only v's category, but also the list of 
u's friends together with their categories; i.e., we effec- 
tively collected a star sample of Facebook users. By 
discarding the information about v^s nodes, we can also 
use the induced subgraph estimators, for comparison. 

The 2009 data sets: These data sets was collected 
in April 2009 [20], using three existing sampling tech- 
niques, UIS, MHRW and RW, as summarized in Ta- 
bled) At that time, a Facebook user could be a member 
of any of four different types of categories, called "net- 
works" in the Facebook terminology. Three of them, 
high school, college and workplace, required passing a 
verification process, usually based on an email account 
from the institution in question. The fourth category, 
geographical region, did not require any verification, and 
indicated the user's city, state or country. In this pa- 
per, we consider the geographical region categories from 
the 2009 data sets. Each dataset consisted of 100-1000 
samples from each of the 507 geographical regions, as 
shown in Fig. [5l[a) ; UIS collected about two times fewer 
samples than the other two techniques. 

The 2010 data sets: The geographical region cate- 
gory was phased out in June 2009. Therefore, the data 
sets we collected in 2010 [35] contain only the three re- 
maining categories, from which we chose colleges as the 
category studied in this paper. Furthermore, Facebook 
switched from 32 bit to 64 bit userlDs, thus leading to 
a sparse uscrlD space, which made UIS impractical to 
apply. For this reason, in our 2010 Facebook data sets 
we collected only a RW sample (because RW proved to 
outperform MHRW [20,51]) as well as three variants of 
S-WRW [35]. A fuh length (IM) RW typically collected 
only 0-10 samples of a particular college (Fig. El^b)). 
This is because of a relatively small college population 
(about 3.5%) and a large number of colleges (more than 
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Figure 6: Results for 100 most popular regional 
networks in 2009 (a,c) and 100 college networks 
in 2010 (b,d) : category size estimation (a,b), 
and edge weight estimation (c,d). 

10,000). Fortunately, S-WRW, a technique designed 
to oversample particular categories (here colleges), im- 
proves that result by at least one order of magnitude. 

7.2 Category graph estimation 

We present our results in Fig. [HI To calculate NRMSE we 
use as ground truth the average of estimation over all 
samples for each crawl type. In addition, we treat each 
of the 28 and 25 different walks, for the 2009 and 2010 
data sets respectively, 0jS cL different sample. 

7.2.1 Category size 

Wc show the results of Facebook category size esti- 
mation in Fig. |6l[a,b). Similarly to what we observed in 
the simulations in Section [51 UIS performs the best, 
and S-WRW outperforms RW. MHRW performs the 
worst, which was also expected given the recent studies 
of MHRW in [20,51]. Under UIS, the induced estima- 
tor performs better. Under RW and S-WRW, the star 
version is better, especially when categories are small, 
as in the 2010 data set. 

7.2.2 Category edge weight 

The estimation of category edge weights in Facebook, 
shown in Fig. [6l[c,d), also confirms the observations 
in the simulations of Section |6l Indeed, all star es- 
timators dramatically outperform their induced coun- 
terparts. And, as before, the sampling techniques or- 
dered from the best to worst are: UIS, S-WRW, RW 
and MHRW. 
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Finally, note that NRMSEs in Fig. |n{a-d) are rela- 
tively high, even under star {i.e., the better performing) 
sampling. This is because these plots reach only rela- 
tively small sample sizes \S\ {i.e., 25 or 28 times smaller 
than the entire sample at our disposal). Therefore, one 
could extrapolate the plots in Fig.[B]by much more than 
a decade to the right, further reducing the values of 
NRMSE. Moreover, in the data sets that we eventually 
prepare, we combine together several outcomes of dif- 
ferent, independent sampling techniques, which should 
further limit the estimation variance. Therefore, the re- 
sults in Fig. El should be treated as a guideline about 
the relative efficiency of the sampling techniques, rather 
than a comparison of the the absolute values of NRMSE. 

7.3 Geosocial visualization 

Finally, we have developed a highly customizable, 
web-based tool for visualization of our Facebook cat- 
egory graphs. Wc have made a beta-version of the 
tool available at ^www . ge o s o c i almap . c om' and invite the 
reader to use it to experiment with the category-graphs 
described in this paper. This can be used to gain insight 
into the friendship relations among these categories, as 
defined in Facebook^ 

7.3.1 Cross-country friendships 

As mentioned earlier in Section FO] the 2009 data set 
contains the geographical region information, at various 
granularities depending on Facebook's penetration in 
that region. This may may be either a user's city or 
state, {e.g., for USA, Canada, UK) or the entire country 
(more typically). 

As an example, we create the country-to-country friend- 
ship graph. To this end, we first merged together all 
categories coming from the same country. Next, we es- 
timated the sizes of the resulting categories. Because, 
according to Fig. |6l[a), the UIS induced sampling per- 
formed exceptionally well, we used it in the category size 
estimation. This information was next fed to the star 
estimators of category edge weights. Finally, for every 
edge, we take the average of the three estimates (re- 
sulting from UIS, MHRW and RW). Fig. [Tlja) presents 
a subset of "The world according to Facebook" graph. 

7.3.2 North America 

For the USA and Canada, the 2009 data set contains 
the geographical information at the granularity of 272 
counties and provinces. This allows us to create the 

^However, one should be careful about declaring categories 
in Facebook as representative of the real world. First, Face- 
book attracts some age groups more than others. Second, 
many Facebook users do not declare (or hide) their cate- 
gory membership. Finally, a user might have mistakingly 
chosen her category. For example, the third strongest link 
for "Greece" is "Athens, GA", which is clearly mistaken for 
Athens, Greece. 




(a) Intra-continental country connections: Note the 
strong cliques formed between Middle Eastern countries 
and South-East- Asian countries. There is no Facebook in 
China. 




(b) North- American regions: Physical distance is a ma- 
jor factor in the United States (red), but seemingly less 
so in Canada (green). Additionally, US and Canada are 
relatively weakly interconnected (thin blue lines). 




(c) Top 133 US colleges according to the "US News 
World Report'09": Physical distance is a major factor 
for public colleges (green), but seemingly less so for pri- 
vate ones (red). 



Figure 7: The friendship graph between regional 
networks. Available at www.geosocialmap.com 

North American friendship map. Wc followed the same 
steps as in Section 17.3.11 An example is presented in 
Fig. lib). 

7.3.3 US colleges 

Both the 2009 and 2010 data sets contain college cat- 
egories. We chose the 2010 data set to create a coUege- 
to-coUcgc friendship graph. This data set consists of one 
RW sample and three S-WRW ones. Because S-WRW 
performed much better than RW (see Fig. [6ljb,d)), we 
decided to use the three S-WRW samples only. More- 
over, this time we estimated the size with the help 
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of the star estimators, because they performed bet- 
ter (Fig. ini^b)). Finahy, as before, we fed the resulting 
category sizes into the star estimators of category edge 
weights, and we averaged the three S-WRW estimates 
into a final estimate. Fig. ^c) presents a subgraph of 
the resulting category graph. 

8. RELATED WORK 

Node sampling in graphs. Most state-of-the-art 
crawling-based node sampling techniques use variants of 
random walks (RW), such as the classic RW [20,27,41, 
51,56], Metropolis-Hasting RW (MHRW) [18,20,42,51, 
60], multiple dependent RW [52], multigraph RW [19], 
RW with jumps [6,30,38,53], and weighted RW [35]. 
Based on the resulting (uniform or non-uniform) sample 
of nodes, there exist principled methods to estimate lo- 
cal graph properties (degree distribution, assortativity 
and clustering coefficient). [34] is an excellent introduc- 
tion; other examples include [3,6,20,21,26,37,51-53,59]. 
In our prior work [19,20,35], we used random- walk based 
crawls to collect user samples, which we use as input to 
the estimators proposed in this paper. 
Topology inference. Much classic work on inference 
for basic network properties from node samples was 
done by Ove Frank and colleagues; see particularly [13- 
15], which introduce Horvitz-Thompson estimators of 
edge totals (i.e., volumes) from probability samples of 
nodes. Early results involving topology inference from 
induced subgraph and star sampling were reviewed by 
[16]. This prior work focused on the case of known 
population and category sizes, and assumed without- 
replacement designs. 

Breadth First Search (BFS) has been used to sample 
topology e.g., in [4,43,44]. However, a BFS sample is 
known to introduce a strong bias towards high degree 
nodes [7,20,36,37,46,70], which makes it not represen- 
tative with respect to many metrics. Although this de- 
gree bias can often be significantly corrected for [36], 
the BFS sample covers only the neighborhood of the 
arbitrary starting node, which is not necessarily repre- 
sentative of the entire topology. 

[38] evaluates a number of sampling methods and the 
graphs they induce. The authors conclude that Forest 
Fire [39], intuitively a hybrid of RW and BFS, produces 
topology samples that resemble the original graph the 
most. However, Forest Fire is subject to the same biases 
as BFS described above. 

Another approach for inferring network structure is 
matrix completion of the distance matrix [10,68]. How- 
ever, this approach faces its own challenges when ap- 
plied to OSN samples. First, the distance matrix is 
typically high rank and one has to carefully identify a 
low rank structure [10]. Second, unlike traceroutes or 
tomographic techniques, crawling does not yield a ran- 
dom sample of distances [10,68]. 



Induced subgraph vs. star sampling [34] is a good 
summary of these two sampling designs. Induced sub- 
graph sampling has been studied, e.g., in [34,37,38] Star 
sampling is similar to egonet sampling [66] , except that 
under star sampling we do not see edges between neigh- 
bors of a sampled node. Our contribution here is to 
apply these measurement schemes in the context of cat- 
egory graph estimation. 

Block models and mixing rates The use of parti- 
tions to produce reduced-form versions of larger net- 
works has an extensive history in the social network lit- 
erature, primarily under the label of "block modeling;" 
see [49,66] for extensive reviews. Block models with 
known partitions are sometimes called "confirmatory" 
block models, and have been studied largely from a sta- 
tistical point of view (e.g., [ll,12]and [66] ch. 16). Much 
of the latter interest is in modeling the edge weights 
( "block densities" or "mixing rates" ) from covariatcs or 
other information in a fully-observed context, with con- 
siderable additional interest in the case where the net- 
work is observed but the categories are latent [48,58]. 
Estimation of mixing rates from uniform node sam- 
ples for categories of known size is also a well-known 
problem (see., e.g., [13,45,69]). Comparable methods 
for link-trace samples are less well-developed, though 
see [17,24,27,28]. 

Although estimation of mixing rates from sampled 
data is relatively straightforward where categories are 
of known size and the number of categories |C| is fairly 
small (so that a random sample provides large num- 
bers of vertex pairs in each pair of categories), it is 
much more difficult when |C| is large and category sizes 
are not known. Our techniques thus extend the prior 
literature on block models and mixing rates to cases 
such as group interaction in OSNs and other large-scale 
social networks, in which one must estimate interac- 
tion among many groups of uncertain size from (typi- 
cally non-uniformly) sampled data. Our work also dif- 
fers from much recent social network literature in being 
design-based rather than model-based; design-based in- 
ference is frequently easier to employ than model-based 
inference, although both approaches have merits [61]. 

Facebook colleges. The Facebook social graph has 
been measured and studied in the past. For exam- 
ple, [22] studies the interactions between all 4.2M Face- 
book users in 492 universities in North America between 
Feb 2004 and March 2006. (As a side note, the inter- 
pretation is hindered by the full anonymization of user 
and universities.) [62] studies the social structure within 
100 Facebook college categories. Given the above full 
datasets, one could apply Eq.Q and create the cate- 
gory graph. In contrast, our methodological contribu- 
tion lies in estimating the category graph from a sample 
of nodes, not from the fully known user graph. 
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Social graph visualization. There exist many tools 
that visualize social graphs (including Facebook), for 
example [1,2,29]. www.geosocialmap.com differs from 
most of these tools in that it (i) is category-centric 
(vs user-centric), (ii) contains an aggregated informa- 
tion view of entire Facebook population, (iii) is well 
suited for data exploration {e.g., allows arbitrary se- 
lection of categories), and (iv) accepts as input any 
weighted graph with arbitrary set of node/edge attributes 
(ongoing work). 

9. CONCLUSION 

Estimation performance. In this paper, we derive 
a number of category graph estimators for probability 
samples of nodes, uniform (Section |4|) and non- uniform 
(Section [5]) . We evaluate their performance in simula- 
tion (Section |6|) and on Facebook samples (Section [7|). 
We showed that they all converge to their true val- 
ues for reasonable sample sizes, a result we extend for- 
mally in the Appendix. Based on our evaluation, we 
also provide recommendations, summarized as follows. 
When estimating category sizes, there is no universal 
choice between induced and star sampling. For ex- 
ample, the performance of the star estimator improves 
(i) in dense graphs, (ii) in graphs with homogeneous 
node degree distribution, (iii) in graphs with weaker 
community structure, and (iv) under sampling tech- 
niques that oversample high degree nodes. In contrast, 
when estimating the category edge weights, the star es- 
timators are a clear winner; the induced subgraph esti- 
mators often need 5-10 times more samples to achieve 
the same accuracy. Finally, the sampling techniques 
strongly affect estimator efficiency. They can be or- 
dered from best to worst as follows: UIS, S-WRW, RW 
and MHRW. 

Potential applications. We applied our methodol- 
ogy to samples of Facebook users and we estimated po- 
tentially interesting category graphs, such as the global 
friendship map, or the friendship network of US col- 
leges. We visualized and made publicly available these 
weighted topologies at www.geosocialmap.com. 

In addition to purely descriptive uses, the techniques 
described here can also be employed as a first step to- 
wards model-based analysis. Using the unnormalizcd 
edge weights together with the number of possible edges 
within each cut yields the numbers of possible and re- 
alized edges needed for likelihood-based analysis of in- 
teraction probabilities. For instance, given additional 
features associated with each category {e.g., for univer- 
sities, their size, location, ranking, and expense), one 
can model the inter-category mixing rates as a func- 
tion of category features {e.g., the effect of geographi- 
cal distance on tie probability). This permits both hy- 
pothesis testing for putative theories of tie formation 
and ex ante prediction of interaction rates among new 



or unobserved categories (given their hypothesized fea- 
tures) for extremely large, incompletely observed net- 
works. Given the large and growing literature on sta- 
tistical modeling of networks {e.g., [8,23,31,32,50,55,63] 
among many others), the potential for applications in 
this area is substantial. 
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Appendix: Consistency of the estimators 

A desirable property of a statistical estimator is that of 
consistency. A statistical estimator (X„) is a function 
of the sample size {n = \S\), and is said to be consis- 

tent if it converges in probability ( — >) to the true value 
of interest (9) [57] (which also implies asymptotic unbi- 

asedness). Formally: If X„ — > 9, as n oo, then Xn 
is said to be consistent for 9. To prove the consistency 
of the estimators in this paper we invoke two classic 
theorems in probability: (1) The Law of Large Num- 
bers, and (2) Slutsky's Theoren:|l; which require the 
following assumptions: For the uniform case we need to 
assume that the mean and variance are finite {9 < oo; 
(T^ < oo); for the non- uniform case we need to make an 
additional assumption on the sampling weights so as to 
guarantee the consistency of the Hansen-Hurwitz (HH) 
estimator, specifically that the sum of the weights be 
bounded {J2v£V ^i''-') — cjZl- Both of these conditions 
are satisfied for finite graphs. 

LLN and Slutsky's Theorem 

Theorem 10.1 (Law of Large Numbers (LLN)). 

Let Xi, X2, ■ . ■ be i.i.d. random variables with EXi = 9 

and Var Xi = < 00. Then Xn = n'^ves ■^i''^) ^• 

Theorem 10.2 (Slutsky's Theorem). Let X^ 

p 

a and Yn — l3, where a and fi, respectively, are real 
numbers. Then 

(p.l) X„ + r„ Aa + /3; 

(p.2) X„ • r„ A a • /3; 

(p.3) ^ f , where /? 7^ 0. 

Uniform sampling estimators 

Eq.©: \A\ = N-\^-- ^ 
the LLN. 



^E„esl{«eA} A 1^1 by 



kv = 

Hues. 



dog(«) 



ky and 



\3a\ 



— !• k^ by the LLN (as above). 

Z^ses Z]t,eA/'(s)l{''eA} 

by an application 



Eq.de]) 

Eq.©: h - ^ 

^ i;V0l(S) ' ^ A 

of the LLN to both the numerator and denomina- 
tor, separately, followed by an application of Slut- 
sky's Theorem (p.3). 

Eq.dl]): \A\ = ^ ■ /J' • -A |A| by two applications 

of Slutsky's Theorem (p.2 and p.3) and consistency 
of the individual estimators. 



®For more details about these two theorems see [57]. 

There are some alternate assumptions on the weights that 
can be made to guarantee convergence. 



Eq.®: w(A,B) 



E 



I^A b I 

z^aes z^tes -'-{aes^ and bess) ' " ' 

by LLN and Slutsky's Theorem (p.3). 

E„es. l^^.-.sl + E6es„ \Eb.A\ 



w(A, B) 



Eq.® 



{A,B) 



\Sa\-\B\ + \Sb\-\A\ 



\Ea.B\ + f 



\Ea.b\ + \Ea,b\ 
A||B| + |A||B| 



^\Sa\-\B\ + ^\Sb\-\A\ 

w(A, B) by the LLN on numerator and denomina- 
tor and then by five applications of Slutsky's The- 
orem (p.l, p.2, and p.3). 



Non-uniform sampling estimators 

Eq.dlO]): Xt^t = ^ J^ves shown to be consistent 

in [25]. 



Eq.dnD: \A\ 



N ■ 



^w-i (Sa) 

Jrw-i(S) 



JA] by the consis- 



tency of the HH estimator in the numerator and 
denominator and then by Slutsky's Theorem (p.3). 



Eq.dli]): kv = 



^w_i(S) 

deE(t.) 



kv and 



kA by the consistency of 



the HH estimator and Slutsky's Theorem (p.3) 

_ i Esgs(^ E„EAf(a)l{^£A}) P 



^E, 



by the consistency of the HH estimator and Slut- 
sky's Theorem (p.3). 

Eq.dll]): \A\ = iV • /J' • J^J by the consistency 

of the estimators and Slutsky's Theorem (p.2 and 
p.3). 



Eq.dUD: w(A,B 



EaeSj E 



beSr 



w(a)-w(b) 



:^w_i{Sa)-vj-i(Sb) 

w{A, B) by the consistency of the HH 



estimator and Slutsky's Theorem (p.2 and p.3). 

J_ ^ l-Ea,B , J. ^ gfc.A 



Eq-dH]): w{A,B) = 

P \Ea.b\ + \Ea,b 



iw_i(SA)-|B| + iw_i(SB)-|A| 

w(A, B) by the consistency of 



|A||BH-|A||B| 

HH estimators in the numerator and denominator 
and then by five applications of Slutsky's Theorem 
(p.l, p.2 and p.3). 

A note on dependent samples 

These results continue hold in the case of depen- 
dent (correlated) samples, such as RW, under the con- 
dition that these samples converge asymptotically to 
UIS or WIS limits. This follows from the ergodic the- 
orem, which provides a corresponding LLN for conver- 
gent Markov Chains. For more technical details on the 
LLN in the context of dependent samples see [52]. 
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